true

Final assignment - Gyan Dookie


Final Assignment

Overview

We live in a turbulent world today. One current socially destabilizing and publically debated phenomena has been the flow of refugees and asylum seekers from Middle East to Europe. In this IODS final assignment my aim is to explore the possible relations among variables that are in my view connected to the ongoing so called “refugee crisis”. I’ll be focusing on countries that are bombing the countries where the major flow of refugees originate from. Here are some questions that interest me. How many refugees do the “bombing” countries have / accept inside their borders? Does taking part in foreign wars increase the risk of terror attacks? How are these warring countries connected to the arms trade? How succesful economically and socially are these countries? My hypothesis is at this point still vague. It could be put bluntly though: “Rich countries that are main players in the arms business are bombing poor countries and by doing so are increasing refugee flows across their borders and the risk of terrorist attacks.” Nevertheless, this would be likely oversimplifying the reality.

In the following data analysis I’m going to perform Principal Component Analysis (PCA) on the ref dataset. Before the actual analysis I’ll go through some preliminary explorations of the data.

It’s a good idea to introduce the variables of the ref dataset before moving further.

  • RefAsyl = The amount of refugees per 1000 inhabitants in the asylum country (2015)
  • ArmsTra = The amount of a country’s arms exports (2015)
  • BombCtry = Hown many countries the country has been bombing between 2015-2016
  • TerAtt = How many islamist terror attacks has there been in the country between 2015-2017
  • GNI = Gross National Income per capita (2015)
  • FSI = The Fragile state index

Here is the link to my data wrangling file.

1. Preliminary explorations of the data

1.1 The structure and the dimensions of the data

Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.

  • 155 rows
  • 8 variables
## [1] 155   8
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

1.2 The summary of the data

Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Here are the standard deviations of the variables.

##   sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp)  sd(GNI) sd(Mat.Mor)
## 1   0.2416396   0.1987786    2.840251     8.332064 18543.85    211.7896
##   sd(Ado.Birth) sd(Parli.F)
## 1      41.11205    11.48775

1.3 The graphical overview of the summarized data

Let’s visualize our data to get a better overall picture of it. First we’ll produce a matrix plot with the basic package’s pairs() and then with GGally package’s ggpairs().

1.4 The correlations of the data with corrplot()

Now it’s time to produce a table of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.

##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00

1.5 The graphical overview of correlations with the advanced corrplot() function

Here’s the visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)

1.6 The above summaries and visualizations showed the following

Here are some of the exposed correlations

  • Strong positive correlations between the following variable pairs
    • Edu.Exp : Life.Exp
    • Mat.Mor : Ado.Birth
  • Moderate positive correlation between the following variable pairs
    • Edu.Exp : Edu2.FM
    • Life.Exp : Edu2.FM
    • Life.Exp : GNI
    • GNI : EduExp
  • Strong negative correlation between the following variable pairs
    • Life.Exp : Mat.Mor
    • Mat.Mor : Edu.Exp
  • Moderate negative correlation between the following variable pairs
    • Ado.Birth : Edu2.FM
    • GNI: Ado.Birth
  • Minimal or zero correlation between the following variable pairs
    • GNI : Labo.FM
    • GNI : Parli.FM
    • Parli.FM : Edu2.FM
    • Labo.FM : Edu.Exp

2. Performing the Principal component analysis (PCA) on the human data

2.1 PCA on non-standardized data

Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.

## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

There’s something wrong with this PCA and it’s plot. The first principal (PC1) component explain 100% of the variance and the following principal components (PC2-PC8) explain 0 %. The only variable name shown is GNI which is connected to the first principal component. PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. The human dataset’s GNI variable has a radically bigger scale and thus bigger variance than other variables (the long arrow also tells us there’s a quite big stadard variation within this variable). This is why this PCA with non-standardized variables failed miserably.

Let’s fix this problem by standardizing the data before using it in the principal component analysis.

2.2 PCA on standardized/scaled data

Here are the summaries of the scaled variables. See how the variables changed ( e.g. the means are now all at zero). As we can see below, the distribution of explainability is now more spread among the PC’s. The PCA plot also makes now a lot more sense.

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Let’s take a closer look at the countries on the “west side” of the biplot and close to PC2.

2.3 Intepretation and analysis of PCA and the corresponding biplots

Let’s interpret the results of both analysis and their corresponding biblots The biplot that was plotted from the non-standardized data (the one with the blue arrow) was not very informative, as we learnt above. The second biplot based on the standardized variables on the contrary offers a lot of interesting and visible information.

  • Parli.F and Labo.FM variables
    • The angle between thes variables is quite small (about30-degrees) so they are positively quite strongly correlated
    • The arrows are pointing upwards, neither towards PC1 nor towards PC2. This shows that Parli.F and Labo.FM don’t correlate with PC1 and PC2
    • Countries in this group include Ruanda and Tansania
  • Mat.Mor and Ado.Birth
    • Mat.Mor and Ado.Birth have a strong positive correlation with each other
    • Countries in this group include for example Côte d’Ivoire, Sierra Leone and Burkina Faso
  • Edu.Exp, Life.Exp, Edu2.FM and GNI
    • All these 4 variables have a strong positive correlation with each other
    • They also correlate positively with PC2
    • Countries in this group include for example Korea (Republic), Venezuela, Japan, Bosnia, Czechoslovakia,Singapore, Ireland and The United states
  • Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI
    • From this you can conclude for example that countries with higher GNI have smaller Mat.Mor
  • Parli.F and Labo.FM have close to zero correlation with the other variables and the PC’s (PC1 and PC2)

Intepretation of PC1

Generally speaking, the 1st principal component captures the maximum amount of variance from the features in the original data. Here the amount of variance of the data captured by PC1 is 53.6 %. The variables/features connected to the PC1 dimension are Mat.Mor (maternal mortality) and Ado.Birth (adolescent birth) pointing their arrows horizontally to the right and Edu.Exp, Life.Exp, Edu2.FM and GNI pointing their arrows horizontally to the left (Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI, as I explained above). The countries on the right end of the PC1’s horizontal axis are mostly poor African countries with low education connected variable values and on the opposite side (left) rich European and Asian countries (+ the USA) with high education connected variable values.

Intepretation of PC2

The 2nd principal component PC2 is orthogonal to the first and it captures the maximum amount of variability/variance left. Here that amount is 16.2 %. PC2 describes how actively women take part in the political sphere and the working life of the society they live in. Many Arab states are located at the low end of the vertical PC2 axis shown in the plot.

3. Performing the Multiple Correspondence Analysis (MCA) on the tea data

Next we’ll load the tea dataset from the package Factominer and explore the data briefly.

3.1 The structure and dimensions of the data

Let’s look at the structure and the dimensions of the data first. Then we’ll create a subset of it by selecting the following variables.

  • Sport
  • effect.on.health
  • sophisticated
  • spirituality
  • friends
  • sex
## [1] "Sport"            "effect.on.health" "sophisticated"   
## [4] "spirituality"     "friends"          "sex"
## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"

3.2 The structure and the summary of the subsetted data

## 'data.frame':    300 obs. of  6 variables:
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##            Sport                effect.on.health           sophisticated
##  Not.sportsman:121   effect on health   : 66     Not.sophisticated: 85  
##  sportsman    :179   No.effect on health:234     sophisticated    :215  
##            spirituality        friends    sex    
##  Not.spirituality:206   friends    :196   F:178  
##  spirituality    : 94   Not.friends:104   M:122

3.3 The visual overview of the data

3.4 Performing the Multiple Correspondence analysis on the data

Let’s do the multiple correspondence analysis of selected tea variables.

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.225   0.175   0.169   0.159   0.140   0.132
## % of var.             22.474  17.492  16.890  15.938  14.029  13.176
## Cumulative % of var.  22.474  39.967  56.857  72.795  86.824 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   |  0.941  1.315  0.730 |  0.294  0.165  0.071 | -0.407
## 2                   |  0.561  0.468  0.290 | -0.123  0.029  0.014 | -0.554
## 3                   |  0.389  0.225  0.176 | -0.691  0.909  0.555 | -0.367
## 4                   | -0.333  0.165  0.087 |  1.007  1.933  0.791 | -0.108
## 5                   |  0.599  0.532  0.239 |  0.722  0.993  0.347 | -0.402
## 6                   |  0.941  1.315  0.730 |  0.294  0.165  0.071 | -0.407
## 7                   |  0.769  0.878  0.599 | -0.273  0.142  0.076 | -0.220
## 8                   |  0.069  0.007  0.007 |  0.119  0.027  0.019 | -0.392
## 9                   |  0.449  0.300  0.235 |  0.536  0.548  0.335 | -0.245
## 10                  |  0.501  0.373  0.186 |  0.337  0.216  0.084 | -0.274
##                        ctr   cos2  
## 1                    0.326  0.136 |
## 2                    0.605  0.282 |
## 3                    0.266  0.157 |
## 4                    0.023  0.009 |
## 5                    0.318  0.107 |
## 6                    0.326  0.136 |
## 7                    0.095  0.049 |
## 8                    0.304  0.211 |
## 9                    0.119  0.070 |
## 10                   0.149  0.056 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr
## Not.sportsman       |  -0.747  16.695   0.377 -10.621 |   0.064   0.158
## sportsman           |   0.505  11.285   0.377  10.621 |  -0.043   0.107
## effect on health    |   0.342   1.912   0.033   3.143 |  -0.007   0.001
## No.effect on health |  -0.097   0.539   0.033  -3.143 |   0.002   0.000
## Not.sophisticated   |   1.003  21.145   0.398  10.907 |  -0.436   5.129
## sophisticated       |  -0.397   8.359   0.398 -10.907 |   0.172   2.028
## Not.spirituality    |   0.305   4.742   0.204   7.811 |  -0.336   7.405
## spirituality        |  -0.669  10.392   0.204  -7.811 |   0.737  16.229
## friends             |  -0.170   1.396   0.054  -4.029 |  -0.494  15.163
## Not.friends         |   0.320   2.631   0.054   4.029 |   0.930  28.577
##                        cos2  v.test     Dim.3     ctr    cos2  v.test  
## Not.sportsman         0.003   0.912 |   0.195   1.506   0.026   2.765 |
## sportsman             0.003  -0.912 |  -0.131   1.018   0.026  -2.765 |
## effect on health      0.000  -0.067 |   1.762  67.420   0.876  16.184 |
## No.effect on health   0.000   0.067 |  -0.497  19.016   0.876 -16.184 |
## Not.sophisticated     0.075  -4.739 |  -0.285   2.271   0.032  -3.099 |
## sophisticated         0.075   4.739 |   0.113   0.898   0.032   3.099 |
## Not.spirituality      0.248  -8.612 |  -0.004   0.001   0.000  -0.096 |
## spirituality          0.248   8.612 |   0.008   0.002   0.000   0.096 |
## friends               0.459 -11.716 |   0.160   1.640   0.048   3.787 |
## Not.friends           0.459  11.716 |  -0.301   3.092   0.048  -3.787 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## Sport               | 0.377 0.003 0.026 |
## effect.on.health    | 0.033 0.000 0.876 |
## sophisticated       | 0.398 0.075 0.032 |
## spirituality        | 0.204 0.248 0.000 |
## friends             | 0.054 0.459 0.048 |
## sex                 | 0.282 0.265 0.032 |

3.5 Analyzing the MCA’s summary table

  • The eigen values show the amount of variance captured by the different dimensions. Here we can see the following.
    • Dimension 1 captures 22.5 % of the variance within the data
    • Dimension 2 captures 17.5 % of the variance within the data *The first two dimensions capture approximately 40 % of the variance within the data. This amount is considerably lower than the amount captured by the first two principal components in the previous analysis of the human dataset.
  • The individuals
    • the individuals coordinates, the individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions.
  • The categories
    • The Categories part shows coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v.test value.
  • The Categorical value
    • Shows the squared correlation between each variable and the dimensions
    • If the value is close to one it indicates a strong link with the variable and dimension. Here (among the three dimensions shown), only Dimension 3 and and the effect.on.health have a categorical variable that is close to one (0.876)

3.6 Visualizing the Multiple Correspondence Analysis with the plot() function

3.7 Some conclusions of the MCA and it’s plot

  • There’s a clear difference in the social role of tea between females and males. Men prefer drinking tea alone more often than women and women tend to drink tea with friends more often than men.

This proportional barplot confirms that women drink tea more with friends than men do (which was also suggested by the MCA-plot above).

The above MCA-biplot showed that women regard tea drinking more than men as sophisticated. This finding is confirmed in the barplot below.



Final Assignment

The Human Development Index dataset was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. Here, I have wrangled this data into a more concise form and named it “human”. In the following data analysis I’m going to perform Principal Component Analysis (PCA) on the human dataset and then Multiple Correspondence Analysis (MCA) on the tea dataset. Before the actual analysis I’ll go through some preliminary explorations of the data.

It’s a good idea to introduce the variables of the human dataset before moving further.

  • Edu2.F = Proportion of females with at least secondary education
  • Labo.FM = Proportion of females in the labour force
  • Edu.Exp = Expected years of schooling
  • Life.Exp = Life expectancy at birth
  • GNI = Gross National Income per capita
  • Mat.Mor = Maternal mortality ratio
  • Ado.Birth = Adolescent birth rate
  • Parli.F = Percetange of female representatives in parliament

Here are the links to the metadata of the Human Development Index dataset.

1. Preliminary explorations of the data

1.1 The structure and the dimensions of the data

Below is the structure of the human dataset. The following characteristics of the dataframe can be discerned.

  • 155 rows
  • 8 variables
## [1] 155   8
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

1.2 The summary of the data

Here we’ll print out the summary of the data with the summary() function to get a grasp of the min, max, median, mean and quantiles of the data.

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Here are the standard deviations of the variables.

##   sd(Edu2.FM) sd(Labo.FM) sd(Edu.Exp) sd(Life.Exp)  sd(GNI) sd(Mat.Mor)
## 1   0.2416396   0.1987786    2.840251     8.332064 18543.85    211.7896
##   sd(Ado.Birth) sd(Parli.F)
## 1      41.11205    11.48775

1.3 The graphical overview of the summarized data

Let’s visualize our data to get a better overall picture of it. First we’ll produce a matrix plot with the basic package’s pairs() and then with GGally package’s ggpairs().

1.4 The correlations of the data with corrplot()

Now it’s time to produce a table of the correlations with the cor() function. Here the correlations were rounded to two desimals to save space.

##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00

1.5 The graphical overview of correlations with the advanced corrplot() function

Here’s the visualization of the correlation matrix with the advanced corrplot() function. To reduce repetition, we’ll visualize only the upper part of the plot (as is well known, the top part of the correlation matrix contains the same correlations as the bottom part)

1.6 The above summaries and visualizations showed the following

Here are some of the exposed correlations

  • Strong positive correlations between the following variable pairs
    • Edu.Exp : Life.Exp
    • Mat.Mor : Ado.Birth
  • Moderate positive correlation between the following variable pairs
    • Edu.Exp : Edu2.FM
    • Life.Exp : Edu2.FM
    • Life.Exp : GNI
    • GNI : EduExp
  • Strong negative correlation between the following variable pairs
    • Life.Exp : Mat.Mor
    • Mat.Mor : Edu.Exp
  • Moderate negative correlation between the following variable pairs
    • Ado.Birth : Edu2.FM
    • GNI: Ado.Birth
  • Minimal or zero correlation between the following variable pairs
    • GNI : Labo.FM
    • GNI : Parli.FM
    • Parli.FM : Edu2.FM
    • Labo.FM : Edu.Exp

2. Performing the Principal component analysis (PCA) on the human data

2.1 PCA on non-standardized data

Next we’ll perform principal component analysis (PCA) on the not standardized human data and show variability captured by the principal components.

## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

Then we’ll draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

There’s something wrong with this PCA and it’s plot. The first principal (PC1) component explain 100% of the variance and the following principal components (PC2-PC8) explain 0 %. The only variable name shown is GNI which is connected to the first principal component. PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. The human dataset’s GNI variable has a radically bigger scale and thus bigger variance than other variables (the long arrow also tells us there’s a quite big stadard variation within this variable). This is why this PCA with non-standardized variables failed miserably.

Let’s fix this problem by standardizing the data before using it in the principal component analysis.

2.2 PCA on standardized/scaled data

Here are the summaries of the scaled variables. See how the variables changed ( e.g. the means are now all at zero). As we can see below, the distribution of explainability is now more spread among the PC’s. The PCA plot also makes now a lot more sense.

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Let’s take a closer look at the countries on the “west side” of the biplot and close to PC2.

2.3 Intepretation and analysis of PCA and the corresponding biplots

Let’s interpret the results of both analysis and their corresponding biblots The biplot that was plotted from the non-standardized data (the one with the blue arrow) was not very informative, as we learnt above. The second biplot based on the standardized variables on the contrary offers a lot of interesting and visible information.

  • Parli.F and Labo.FM variables
    • The angle between thes variables is quite small (about30-degrees) so they are positively quite strongly correlated
    • The arrows are pointing upwards, neither towards PC1 nor towards PC2. This shows that Parli.F and Labo.FM don’t correlate with PC1 and PC2
    • Countries in this group include Ruanda and Tansania
  • Mat.Mor and Ado.Birth
    • Mat.Mor and Ado.Birth have a strong positive correlation with each other
    • Countries in this group include for example Côte d’Ivoire, Sierra Leone and Burkina Faso
  • Edu.Exp, Life.Exp, Edu2.FM and GNI
    • All these 4 variables have a strong positive correlation with each other
    • They also correlate positively with PC2
    • Countries in this group include for example Korea (Republic), Venezuela, Japan, Bosnia, Czechoslovakia,Singapore, Ireland and The United states
  • Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI
    • From this you can conclude for example that countries with higher GNI have smaller Mat.Mor
  • Parli.F and Labo.FM have close to zero correlation with the other variables and the PC’s (PC1 and PC2)

Intepretation of PC1

Generally speaking, the 1st principal component captures the maximum amount of variance from the features in the original data. Here the amount of variance of the data captured by PC1 is 53.6 %. The variables/features connected to the PC1 dimension are Mat.Mor (maternal mortality) and Ado.Birth (adolescent birth) pointing their arrows horizontally to the right and Edu.Exp, Life.Exp, Edu2.FM and GNI pointing their arrows horizontally to the left (Mat.Mor and Ado.Birth have a strong negative correlation with Edu.Exp, Life.Exp, Edu2.FM and GNI, as I explained above). The countries on the right end of the PC1’s horizontal axis are mostly poor African countries with low education connected variable values and on the opposite side (left) rich European and Asian countries (+ the USA) with high education connected variable values.

Intepretation of PC2

The 2nd principal component PC2 is orthogonal to the first and it captures the maximum amount of variability/variance left. Here that amount is 16.2 %. PC2 describes how actively women take part in the political sphere and the working life of the society they live in. Many Arab states are located at the low end of the vertical PC2 axis shown in the plot.

3. Performing the Multiple Correspondence Analysis (MCA) on the tea data

Next we’ll load the tea dataset from the package Factominer and explore the data briefly.

3.1 The structure and dimensions of the data

Let’s look at the structure and the dimensions of the data first. Then we’ll create a subset of it by selecting the following variables.

  • Sport
  • effect.on.health
  • sophisticated
  • spirituality
  • friends
  • sex
## [1] "Sport"            "effect.on.health" "sophisticated"   
## [4] "spirituality"     "friends"          "sex"
## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  [1] "breakfast"        "tea.time"         "evening"         
##  [4] "lunch"            "dinner"           "always"          
##  [7] "home"             "work"             "tearoom"         
## [10] "friends"          "resto"            "pub"             
## [13] "Tea"              "How"              "sugar"           
## [16] "how"              "where"            "price"           
## [19] "age"              "sex"              "SPC"             
## [22] "Sport"            "age_Q"            "frequency"       
## [25] "escape.exoticism" "spirituality"     "healthy"         
## [28] "diuretic"         "friendliness"     "iron.absorption" 
## [31] "feminine"         "sophisticated"    "slimming"        
## [34] "exciting"         "relaxing"         "effect.on.health"

3.2 The structure and the summary of the subsetted data

## 'data.frame':    300 obs. of  6 variables:
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##            Sport                effect.on.health           sophisticated
##  Not.sportsman:121   effect on health   : 66     Not.sophisticated: 85  
##  sportsman    :179   No.effect on health:234     sophisticated    :215  
##            spirituality        friends    sex    
##  Not.spirituality:206   friends    :196   F:178  
##  spirituality    : 94   Not.friends:104   M:122

3.3 The visual overview of the data

3.4 Performing the Multiple Correspondence analysis on the data

Let’s do the multiple correspondence analysis of selected tea variables.

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.225   0.175   0.169   0.159   0.140   0.132
## % of var.             22.474  17.492  16.890  15.938  14.029  13.176
## Cumulative % of var.  22.474  39.967  56.857  72.795  86.824 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   |  0.941  1.315  0.730 |  0.294  0.165  0.071 | -0.407
## 2                   |  0.561  0.468  0.290 | -0.123  0.029  0.014 | -0.554
## 3                   |  0.389  0.225  0.176 | -0.691  0.909  0.555 | -0.367
## 4                   | -0.333  0.165  0.087 |  1.007  1.933  0.791 | -0.108
## 5                   |  0.599  0.532  0.239 |  0.722  0.993  0.347 | -0.402
## 6                   |  0.941  1.315  0.730 |  0.294  0.165  0.071 | -0.407
## 7                   |  0.769  0.878  0.599 | -0.273  0.142  0.076 | -0.220
## 8                   |  0.069  0.007  0.007 |  0.119  0.027  0.019 | -0.392
## 9                   |  0.449  0.300  0.235 |  0.536  0.548  0.335 | -0.245
## 10                  |  0.501  0.373  0.186 |  0.337  0.216  0.084 | -0.274
##                        ctr   cos2  
## 1                    0.326  0.136 |
## 2                    0.605  0.282 |
## 3                    0.266  0.157 |
## 4                    0.023  0.009 |
## 5                    0.318  0.107 |
## 6                    0.326  0.136 |
## 7                    0.095  0.049 |
## 8                    0.304  0.211 |
## 9                    0.119  0.070 |
## 10                   0.149  0.056 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr
## Not.sportsman       |  -0.747  16.695   0.377 -10.621 |   0.064   0.158
## sportsman           |   0.505  11.285   0.377  10.621 |  -0.043   0.107
## effect on health    |   0.342   1.912   0.033   3.143 |  -0.007   0.001
## No.effect on health |  -0.097   0.539   0.033  -3.143 |   0.002   0.000
## Not.sophisticated   |   1.003  21.145   0.398  10.907 |  -0.436   5.129
## sophisticated       |  -0.397   8.359   0.398 -10.907 |   0.172   2.028
## Not.spirituality    |   0.305   4.742   0.204   7.811 |  -0.336   7.405
## spirituality        |  -0.669  10.392   0.204  -7.811 |   0.737  16.229
## friends             |  -0.170   1.396   0.054  -4.029 |  -0.494  15.163
## Not.friends         |   0.320   2.631   0.054   4.029 |   0.930  28.577
##                        cos2  v.test     Dim.3     ctr    cos2  v.test  
## Not.sportsman         0.003   0.912 |   0.195   1.506   0.026   2.765 |
## sportsman             0.003  -0.912 |  -0.131   1.018   0.026  -2.765 |
## effect on health      0.000  -0.067 |   1.762  67.420   0.876  16.184 |
## No.effect on health   0.000   0.067 |  -0.497  19.016   0.876 -16.184 |
## Not.sophisticated     0.075  -4.739 |  -0.285   2.271   0.032  -3.099 |
## sophisticated         0.075   4.739 |   0.113   0.898   0.032   3.099 |
## Not.spirituality      0.248  -8.612 |  -0.004   0.001   0.000  -0.096 |
## spirituality          0.248   8.612 |   0.008   0.002   0.000   0.096 |
## friends               0.459 -11.716 |   0.160   1.640   0.048   3.787 |
## Not.friends           0.459  11.716 |  -0.301   3.092   0.048  -3.787 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## Sport               | 0.377 0.003 0.026 |
## effect.on.health    | 0.033 0.000 0.876 |
## sophisticated       | 0.398 0.075 0.032 |
## spirituality        | 0.204 0.248 0.000 |
## friends             | 0.054 0.459 0.048 |
## sex                 | 0.282 0.265 0.032 |

3.5 Analyzing the MCA’s summary table

  • The eigen values show the amount of variance captured by the different dimensions. Here we can see the following.
    • Dimension 1 captures 22.5 % of the variance within the data
    • Dimension 2 captures 17.5 % of the variance within the data *The first two dimensions capture approximately 40 % of the variance within the data. This amount is considerably lower than the amount captured by the first two principal components in the previous analysis of the human dataset.
  • The individuals
    • the individuals coordinates, the individuals contribution (%) on the dimension and the cos2 (the squared correlations) on the dimensions.
  • The categories
    • The Categories part shows coordinates of the variable categories, the contribution (%), the cos2 (the squared correlations) and v.test value.
  • The Categorical value
    • Shows the squared correlation between each variable and the dimensions
    • If the value is close to one it indicates a strong link with the variable and dimension. Here (among the three dimensions shown), only Dimension 3 and and the effect.on.health have a categorical variable that is close to one (0.876)

3.6 Visualizing the Multiple Correspondence Analysis with the plot() function

3.7 Some conclusions of the MCA and it’s plot

  • There’s a clear difference in the social role of tea between females and males. Men prefer drinking tea alone more often than women and women tend to drink tea with friends more often than men.

This proportional barplot confirms that women drink tea more with friends than men do (which was also suggested by the MCA-plot above).

The above MCA-biplot showed that women regard tea drinking more than men as sophisticated. This finding is confirmed in the barplot below.